Skip to content

gh-144888: Replace bloom filter linked lists with continuous arrays to optimize executor invalidating performance#145873

Open
cocolato wants to merge 2 commits intopython:mainfrom
cocolato:gh-144888
Open

gh-144888: Replace bloom filter linked lists with continuous arrays to optimize executor invalidating performance#145873
cocolato wants to merge 2 commits intopython:mainfrom
cocolato:gh-144888

Conversation

@cocolato
Copy link
Contributor

@cocolato cocolato commented Mar 12, 2026

During JIT compilation, when function objects are destroyed or code objects are modified, all executors must be traversed to inspect their dependencies, followed by invalidating the relevant executors. The original implementation stored executors using singly linked lists, resulting in numerous pointer jumps during traversal and consequently poor CPU cache efficiency.

This PR changes the executor storage structure from a linked list to a contiguous array, reducing pointer jumps during traversal to improve CPU cache efficiency. It also implements O(1) deletion using swap-remove, thereby accelerating dependency invalidation operations.

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this.
I've only had time to do a quick scan, but this looks like it should speed up the scan considerably.

_PyBloomFilter bloom;
_PyExecutorLinkListNode links;
int32_t bloom_array_idx; // Index in interp->executor_blooms/executor_ptrs.
_PyExecutorLinkListNode links; // Used by deletion list.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary now? We can traverse all executors using the executor_ptrs array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need it to save deletion list:

cpython/Python/optimizer.c

Lines 332 to 338 in 08a018e

static void
uop_dealloc(PyObject *op) {
_PyExecutorObject *self = _PyExecutorObject_CAST(op);
executor_invalidate(op);
assert(self->vm_data.code == NULL);
add_to_pending_deletion_list(self);
}

@cocolato
Copy link
Contributor Author

@Fidget-Spinner gentle ping, if you have time ,please take a look at this, thanks!

@Fidget-Spinner
Copy link
Member

Do you have benchmarks for this? A microbenchmark is fine.

@cocolato

This comment was marked as outdated.

@cocolato
Copy link
Contributor Author

I wrote a microbench:

bench.py:

import time
N = 1000
ROUNDS = 10
for r in range(ROUNDS):
    classes = []
    for i in range(N):
        cls = type(f"C{i}", (), {"val": i})
        ns = {"cls": cls}
        exec("def f(n):\n o=cls()\n s=0\n for j in range(n): s+=o.val\n return s", ns)
        classes.append((cls, ns["f"]))
    for _, f in classes:
        for _ in range(200):
            f(10)
    t0 = time.perf_counter_ns()
    for cls, _ in classes:
        cls.val = -1
    elapsed = time.perf_counter_ns() - t0
    print(f"round {r}: {elapsed / 1e3:.1f} us  ({elapsed // N} ns/scan)")

test.sh:

export PYTHON_JIT_STRESS=1
echo "=== Baseline (linked list) ===" 
./python_base.exe /tmp/bench_bloom.py 
echo && echo "=== Optimized (contiguous array) ==="
./python.exe /tmp/bench_bloom.py

result:

=== Baseline (linked list) ===
round 0: 5111.4 us  (5111 ns/scan)
round 1: 6275.0 us  (6274 ns/scan)
round 2: 5421.2 us  (5421 ns/scan)
round 3: 5388.6 us  (5388 ns/scan)
round 4: 6240.9 us  (6240 ns/scan)
round 5: 6356.5 us  (6356 ns/scan)
round 6: 6139.0 us  (6139 ns/scan)
round 7: 6383.8 us  (6383 ns/scan)
round 8: 5474.0 us  (5474 ns/scan)
round 9: 6461.1 us  (6461 ns/scan)

=== Optimized (contiguous array) ===
round 0: 2657.8 us  (2657 ns/scan)
round 1: 2792.6 us  (2792 ns/scan)
round 2: 2861.5 us  (2861 ns/scan)
round 3: 2766.0 us  (2765 ns/scan)
round 4: 2650.5 us  (2650 ns/scan)
round 5: 2689.4 us  (2689 ns/scan)
round 6: 2691.5 us  (2691 ns/scan)
round 7: 2769.4 us  (2769 ns/scan)
round 8: 2864.1 us  (2864 ns/scan)
round 9: 2786.0 us  (2786 ns/scan)

@cocolato
Copy link
Contributor Author

However, since the time spent on sacn is too small compared to warmup, I did not observe any noticeable performance improvement in fastmark.

@Fidget-Spinner
Copy link
Member

That's an excellent result!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants